28 research outputs found

    Exploiting Homogeneity of Density in Incremental Hierarchical Clustering

    Get PDF
    Hierarchical clustering is an important tool in many applications. As it involves a large data set that proliferates over time, reclustering the data set periodically is not an efficient process. Therefore, the ability to incorporate a new data set incrementally into an existing hierarchy becomes increasingly demanding. This article describes HOMOGEN, a system that employs a new algorithm for generating a hierarchy of concepts and clusters incrementally from a stream of observations. The system aims to construct a hierarchy that satisfies the homogeneity and the monotonicity properties. Working in a bottom-up fashion, a new observation is placed in the hierarchy and a sequence of hierarchy restructuring processes is performed only in regions that have been affected by the presence of the new observation. Additionally, it combines multiple restructuring techniques that address different restructuring objectives to get a synergistic effect. The system has been tested on a variety of domains including structured and unstructured data sets. The experimental results reveal that the system is able to construct a concept hierarchy that is consistent regardless of the input data order and whose quality is comparable to the quality of those produced by non incremental clustering algorithms

    Exploiting Homogeneity of Density in Incremental Hierarchical Clustering

    Full text link
    Hierarchical clustering is an important tool in many applications. As it involves a large data set that proliferates over time, reclustering the data set periodically is not an efficient process. Therefore, the ability to incorporate a new data set incrementally into an existing hierarchy becomes increasingly demanding. This article describes HOMOGEN, a system that employs a new algorithm for generating a hierarchy of concepts and clusters incrementally from a stream of observations. The system aims to construct a hierarchy that satisfies the homogeneity and the monotonicity properties. Working in a bottom-up fashion, a new observation is placed in the hierarchy and a sequence of hierarchy restructuring processes is performed only in regions that have been affected by the presence of the new observation. Additionally, it combines multiple restructuring techniques that address different restructuring objectives to get a synergistic effect. The system has been tested on a variety of domains including structured and unstructured data sets. The experimental results reveal that the system is able to construct a concept hierarchy that is consistent regardless of the input data order and whose quality is comparable to the quality of those produced by non incremental clustering algorithms

    Aspect-Based Sentiment Analysis Approach with CNN

    Get PDF
    Lots of research has been done on the domain of Sentiment Analysis, for example, research that conducted by Bing Liu's (2012) [1]. Other research conducted in a SemEval competition, the domain of sentiment analysis research has been developed further up to the aspect or commonly called Aspect Based Sentiment Analysis (ABSA) [2]. The domain problem of Aspect Based Sentiment Analysis (ABSA) from SemEval is quite diverse, all of those problems arise mostly from the real data provided. Some existing problems include Implicit, Multi-label, Out Of Vocabulary (OOV), Expression extraction, and the detection of aspects and polarities. This research only focuses on classification aspect and classification of sentiment. This study uses an existing method of Convolution Neural Network (CNN) method, which was introduced again by Alex K. The study by Alex K reduces the error rate by 15%, compared in the previous year the decrease was only 5%. This research would like to propose CNN methods that have been optimized, and use Threshold (CNN-T) to select the best data in training data. This method can produce more than one aspect using one data test. The average result of this experiment using CNN-T got better F-Measure compared to CNN and 3 classic Machine Learning method, i.e. SVM, Naive Bayes, and KNN. The overall F1 score of CNN-T is 0.71, which is greater than the other comparable methods

    Rhetorical Sentences Classification Based on Section Class and Title of Paper for Experimental Technical Papers

    Get PDF
    Rhetorical sentence classification is an interesting approach for making extractive summaries but this technique still needs to be developed because the performance of automatic rhetorical sentence classification is still poor. Rhetorical sentences are sentences that contain rhetorical words or phrases. Rhetorical sentences not only appear in the contents of a paper but also in the title. In this study, features related to section class and title class that have been proposed in a previous research were further developed. Our method uses different techniques to reach automatic section class extraction for which we introduce new, format-based features. Furthermore, we propose automatic rhetoric phrase extraction from the title. The corpus we used was a collection of technical-experimental scientific papers. Our method uses the Support Vector Machine (SVM) algorithm and the Naïve Bayesian algorithm for classification. The four categories used were: Problem, Method, Data, and Result. It was hypothesized that these features would be able to improve classification accuracy compared to previous methods. The F-measure for these categories reached up to 14%.

    Feature Expansion for Sentiment Analysis in Twitter

    Get PDF
    The community's need for social media is increasing, since the media can be used to express their opinion, especially the Twitter. Sentiment analysis can be used to understand public opinion a topic where the accuracy can be measured and improved by several methods. In this paper, we introduce a hybrid method that combines: (a) basic features and feature expansion based on Term Frequency-Inverse Document Frequency (TF-IDF) and (b) basic features and feature expansion based on tweet-based features. We train three most common classifiers for this field, i.e., Support Vector Machine (SVM), Logistic Regression (Logit), and Naïve Bayes (NB). From those two feature expansions, we do notice a significant increase in feature expansion with tweet-based features rather than based on TF-IDF, where the highest accuracy of 98.81% is achieved in Logistic Regression Classifier

    Measuring information credibility in social media using combination of user profile and message content dimensions

    Get PDF
    Information credibility in social media is becoming the most important part of information sharing in the society. The literatures have shown that there is no labeling information credibility based on user competencies and their posted topics. This study increases the information credibility by adding new 17 features for Twitter and 49 features for Facebook. In the first step, we perform a labeling process based on user competencies and their posted topic to classify the users into two groups, credible and not credible users, regarding their posted topics. These approaches are evaluated over ten thousand samples of real-field data obtained from Twitter and Facebook networks using classification of Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (Logit) and J48 algorithm (J48). With the proposed new features, the credibility of information provided in social media is increasing significantly indicated by better accuracy compared to the existing technique for all classifiers

    Ranking the Online Documents Based on Relative Credibility Measures

    Get PDF
    Information searching is the most popular activity in Internet. Usually the search engine provides the search results ranked by the relevance. However, for a certain purpose that concerns with information credibility, particularly citing information for scientific works, another approach of ranking the search engine results is required. This paper presents a study on developing a new ranking method based on the credibility of information. The method is built up upon two well-known algorithms, PageRank and Citation Analysis. The result of the experiment that used Spearman Rank Correlation Coefficient to compare the proposed rank (generated by the method) with the standard rank (generated manually by a group of experts) showed that the average Spearman 0 < rS < critical value. It means that the correlation was proven but it was not significant. Hence the proposed rank does not satisfy the standard but the performance could be improved

    Factors Influencing User’s Adoption of Conversational Recommender System Based on Product Functional Requirements

    Get PDF
    Conversational recommender system (CRS) helps customers get products fitted their needs by repeated interaction mechanisms. When customers want to buy products having many and high tech features (e.g., cars, smartphones, notebook, etc.), most users are not familiar with product technical features. The more natural way to elicit customers’ needs is by asking what they really want to use with the product they want (we call as product functional requirements). In this paper, we analyze four factors, e.g., perceived usefulness, perceived ease of use, trust and perceived enjoyment  associated to user’s intention to adopt the interaction model (in CRS) based on product functional requirements. Result of experiment using technology acceptance model (TAM) indicates that, for users who aren’t familiar with technical features, perceives usefulness is a main factor influencing users’ adoption. Meanwhile, perceived enjoyment plays a role on user’s intention to adopt this interaction model, for users who are familiar with technical features of product

    Model Virtual Laboratory Fisika Modern Untuk Meningkatkan Keterampilan Generik Sains Calon Guru

    Full text link
    : We have developed a virtual laboratory for teaching modern physics. The purpose of this study is to examine the effectiveness of a virtual laboratory model of modern physics on students\u27 generic science skills. The study involved 64 students who were divided into two groups, the experimental group and control group. The research instrument used a generic science skills test that is integrated with the mastery of concepts of modern physics. Data were analyzed by using mean-difference test and normalized gain scores. The results showed an increase in generic science skills in both groups. Indicators showed that the highest increases are logical inference capability and the ability to build concepts. These results indicate that the virtual laboratory model of modern physics is effective in enhancing generic science skills of students

    CT-FC: more Comprehensive Traversal Focused Crawler

    Get PDF
     In today’s world, people depend more on the WWW information, including professionals who have to analyze the data according their domain to maintain and improve their business. A data analysis would require information that is comprehensive and relevant to their domain. Focused crawler as a topical based Web indexer agent is used to meet this application’s information need. In order to increase the precision, focused crawler face the problem of low recall. The study on WWW hyperlink structure characteristics indicates that many Web documents are not strong connected but through co-citation & co-reference. Conventional focused crawler that uses forward crawling strategy could not visit the documents in these characteristics. This study proposes a more comprehensive traversal framework. As a proof, CT-FC (a focused crawler with the new traversal framework) ran on DMOZ data that is representative to WWW characteristics. The results show that this strategy can increase the recall significantly
    corecore